Topic Segmentation with Hybrid Document Indexing
نویسندگان
چکیده
We present a domain-independent unsupervised topic segmentation approach based on hybrid document indexing. Lexical chains have been successfully employed to evaluate lexical cohesion of text segments and to predict topic boundaries. Our approach is based in the notion of semantic cohesion. It uses spectral embedding to estimate semantic association between content nouns over a span of multiple text segments. Our method significantly outperforms the baseline on the topic segmentation task and achieves performance comparable to state-of-the-art methods that incorporate domain specific information.
منابع مشابه
Persian Printed Document Analysis and Page Segmentation
This paper presents, a hybrid method, low-resolution and high-resolution, for Persian page segmentation. In the low-resolution page segmentation, a pyramidal image structure is constructed for multiscale analysis and segments document image to a set of regions. By high-resolution page segmentation, by connected components analysis, each region is segmented to homogeneous regions and identifyi...
متن کاملAutomatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation
Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...
متن کاملJoint Segmentation and Clustering in Text Corpuses
In recent years, many private corporations and government organizations have digitized corpuses of legacy paper documents. Often, these organizations hope to take advantage of digital representations to transform costly manual tasks associated with paper archives into less-costly computer-assisted tasks. The most common approach toward automated information extraction is through inverted indexi...
متن کاملSegmentation of Color Documents by Line Oriented Clustering using Spatial Information
In this contribution we introduce a new method for global segmentation of color documents with a structure based on text frames and pictures. It is based on an extensive analysis of the expected shape of clusters in RGB-color space. The method provides an improved segmentation, and gives a proper basis for indexing and layout analysis. Results are very promising. keywords: color documents, docu...
متن کاملHierarchical segmentation using latent semantic indexing in scale space
This paper describes a new algorithm which discovers the hierarchical organization of a document or media presentation. We use latent semantic indexing to describe the semantic content of the signal, and scale-space segmentation to describe its features at many different scales. We present results from a text document and a video transcript.
متن کامل